Nikolaus Augsten Approximate Matching of Hierarchical Data

نویسنده

  • Nikolaus Augsten
چکیده

The goal of this thesis is to design, develop, and evaluate new methods for the approximate matching of hierarchical data represented as labeled trees. In approximate matching scenarios two items should be matched if they are similar. Computing the similarity between labeled trees is hard as in addition to the data values also the structure must be considered. A well-known measure for comparing trees is the tree edit distance. It is computationally expensive and leads to a prohibitively high run time. Our solution for the approximate matching of hierarchical data are pqgrams. The pq-grams of a tree are all its subtrees of a particular shape. Intuitively, two trees are similar if they have many pq-grams in common. The pq-gram distance is an efficient and effective approximation of the tree edit distance. We analyze the properties of the pq-gram distance and compare it with the tree edit distance and alternative approximations. The pq-grams are stored in the pq-gram index which is implemented as a relation and represents a pq-gram by a fixed-length string. The pq-gram index offers efficient approximate lookups and reduces the approximate pq-gram join to an equality join on strings. We formally proof that the pq-gram index can be incrementally updated based on the log of edit operations without reconstructing intermediate tree versions. The incremental update is independent of the data size and scales to a large number of changes in the data. We introduce windowed pqgrams for the approximate matching of unordered trees. Windowed pq-grams are generated from sorted trees in linear time. Sorting trees is not possible for common ordered tree distances such as the edit distance. We present the address connector, a new system for synchronizing residential addresses that implements a pq-gram based distance between streets, introduces a global greedy matching that guarantees stable pairs, and links addresses that are stored with different granularity. The connector has been successfully tested with public administration databases. Our extensive experiments on both synthetic and real world data confirm the analytic results and suggest that pq-grams are both useful and scalable.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximate Matching of Hierarchical Data Using pq-Grams

When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are ...

متن کامل

Clustering with Proximity Graphs: Exact and Efficient Algorithms

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms on...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

An Improved Semantic Schema Matching Approach

Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...

متن کامل

Hierarchical Bloom Filter Trees for Approximate Matching

Bytewise approximate matching algorithms have in recent years shown significant promise in detecting files that are similar at the byte level. This is very useful for digital forensic investigators, who are regularly faced with the problem of searching through a seized device for pertinent data. A common scenario is where an investigator is in possession of a collection of “known-illegal” files...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008